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Abstract 


This document describes a solution to a problem in the automatic content scoring of the 
multilingual character-by-character highlighting item type. This solution is language independent 
and represents a significant enhancement. This solution not only facilitates automatic scoring but 
plays an important role in clustering students’ responses; consequently, it has a nontrivial impact 
on the refinement of the items and/or their scoring guidelines. Furthermore, though designed for 
a specific problem, the proposed solution is general enough for any educational task that can be 
transformed into a sequential one. To name a few: It can be used for a set of actions expected 
from a student in simulations or learning trajectories as projected by a teacher, inside an 
intelligent tutoring system, or even in a game—or it can simply be used for a set of student 
clicks, button selections, or keyboard hits expected to reach a correct answer. This solution 
provides flexibility for existing automatic-scoring techniques and potentially could provide more 
flexibility if coupled with statistical data-mining techniques. 

Key words: multilingual automated scoring, large-scale assessment, sequential tasks, prefix-infix 
omission and insertion, multilingual character or grapheme sequences, sequence alignment and 
clustering, bio-NLP 
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There is renewed and growing interest from researchers and assessment organizations in 
developing types of items that are not multiple choice, whether for monolingual or multilingual 
assessment. However, it involves considerable time and money to score them manually and the 
existing methods for scoring automatically have limitations and need improvement. Therefore, 
we sought to look at a particular item type where we felt we could address these limitations and 
improve scoring accuracy, with the goal of a solution having broader applications. 

This paper is focused on a content-based, character-based highlighting item type with the 
following specification (Table 1 shows an example in English and Spanish). 

For each item, the test taker is given a stimulus and directions to highlight evidence in the 
stimulus to respond to a certain prompt. A response has a score of true (1) or false (0) or a set of 
score points that does not necessarily consist of a binary set. The content to be scored is 
represented in terms of a set of correct responses ranging from the minimum evidence expected 
to the maximum evidence. In the example, the minimum evidence is given in terms of four 
pieces: less than average, fell 4% between 2007-2009, fell 7% in 2004, and low due to 
unfavorable economic conditions. The maximum is a whole section that contains these four 
pieces. Seen as a set of characters, all correct responses are subsets of the maximum evidence. 
The complement of the maximum, given the character space to be the stimulus, is what is 
considered unacceptable. 

The scoring rules are propositional Boolean logical formulae defined using the minimum 
(these are called match predicates) and the unacceptable (mismatch predicates). A scoring rule is 
a combination of match predicates and mismatch predicates, where a match predicate, P, is true 
if and only if P is completely highlighted and a mismatch predicate, Q, is true if and only if any 
part of Q (any character) is highlighted. In other words, each scoring rule is a function from the 
set of combination of correct responses (the empty set is false obviously for this particular item 
type) to the set of score points using propositional logic such as the scoring rule in Table 1. 

The rule in the example says that a student’s response is true if the pieces of evidence (1), 
(2), and (4) are completely highlighted and nothing outside the maximum is highlighted or if the 
pieces of evidence (1), (3), and (4) are completely highlighted and nothing outside the maximum 
is highlighted. For a true response, a student can highlight up to a whole section that includes the 
four pieces of evidence defined in the minimum but nothing outside that section. 
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Table 1 


Example of a Highlighting Item 


Item 1 (English version) 

Item 1 (Spanish version) 

Stimulus: 

(a web page is given) 

Stimulus: 

(un pagina web es suministrada) 

Directions: 

Look at the web page. Highlight information 
on the page to answer the following question. 

Indicaciones: 

Lee la pagina web. Resalte la infonnacion en 
la pagina que responda la siguiente pregunta. 

Question: 

What does the web page say about the birth 
rate in USA in comparison to other 
countries? 

Pregunta: 

^Que dice la pagina web sobre la tasa de 
natalidad en USA en comparacion con otros 
paises? 

Minimum response: 

1. Less than average 

2. Fell 4% between 2007-2009 

3. Fell 7% since 2007 

4. Low due to unfavorable economic 

conditions 

Respuesta minima: 

1. Mas que el promedio 

2. Cayo 4% entre 2007-2009 

3. Cayo 7% desde 2007 

4. Bajo debido a condiciones 
economicas desfavorables 

Maximum response: 

Entire section starting with “How birth rate” 
and ending with “compared to others” 

Respuesta maxima: 

La seccion comienza con “como el indice de 
natalidad” y tennina con “comparada con 
otros” 

Scoring rule: 

1 if and only if (match(l) AND match(2) 

AND match(4)) OR (match) 1) and match(3) 
AND match(4)) AND NOT mismateh(max c ); 
otherwise 0 where max c is the complement of 
max (in terms of set theory) 

Regia de calificacion: 

1 if and only if (match) 1) AND match(2) 

AND match(4)) OR (match) 1) and match(3) 
AND match(4)) AND NOT mismatch(max c ); 
otherwise 0 donde max c es el complemento 
de max (en terminos de la teoria de conjuntos 


In large-scale computer-based assessment, this item type is introduced and highlighting 
might be enabled one character at a time to allow for multilingual highlighting. In some cases, 
this item type is administered in 99 natural languages. This character-based highlighting implies 
that a test taker’s response can unintentionally either have additional characters at the beginning 
or end of a response or missing characters at the beginning, middle, or end. 
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For example, in the item illustrated in Table 1, a test taker might miss some characters 
and end up highlighting something like 55 than average and fell 4% betwe 007-200. Thus, le is 
missing from (1) and en 2 .... 9 is missing from (2). A student might also highlight the section, 
“How birth rate ... compared to others. How,” in which H, 0 , w were highlighted as additional 
characters beyond the maximum. A student can add characters at the beginning or end but not in 
the middle in this particular item type. Table 2 illustrates the various possibilities associated with 
this item type. We also name each possibility; the names are listed in the first column of Table 2. 


Table 2 

Omissions and Insertions 


Name 

Example 

Prefix-insertion 

By less than average 

Prefix-omission 

ss than average 

Infix-insertion 

N/A for this item type 

Infix-omission 

Fell 4% betw 2007-2009 

Suffix-insertion 

Less than average in comp 

Suffix-omission 

Less than ave 


For each highlighted response, any combination of these possibilities could occur, similar 
to what could occur with a student’s written free-text response (except there is no transposition, 
i.e., no cases where two characters are interchanged —deling versus ceiling —and no insertions 
in the middle). 

Being symbolic logic rules, as designed, scoring rules will score responses with 
additional or missing characters as 0 or false. However, test developers and human raters want to 
consider these responses true or with a score point 1. The question, obviously, is what would be 
the admissible maximum number of characters missing or inserted in each case. The answer to 
this question might vary from one item to another and one language to another. However, as 
shown later, it is not just about the number of characters. 

To summarize the first task that we face: 

For each item, given a set of correct responses and a set of unseen responses for a 
particular multilingual content-based, character-based highlighting item, the aim is to 
automatically score the unseen responses. We will refer to this task as prefix-infix-suffix 
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omissions and insertions (for this particular item type, as mentioned, there is no infix insertion, 
but the aim is a general solution that could work for this and other tasks). 

Additional scoring concerns exist. An initial look at the responses revealed that a 
nontrivial number of students within the same language or across languages responded in a very 
similar false manner. We noticed that this issue, in general, does not seem to be due to character- 
by-character highlighting unintentional errors, as seen with the previous issue. Hence, in some 
cases it made us question the suitability of the wordings in the prompt or the stimulus (such as 
the passage), its interpretation by students, and the scoring rules, that is, why one response is 
considered true—or according to the scoring rules can be given a score point of 1—while 
another that seems to be logically suitable is considered false and can be given a score of 0 only. 
The issues are particularly complex and challenging when working with different languages, 
countries, cultures, and background knowledge. 

Hence, the second task or question that we need to consider is: Given the set of 
responses, can we inform the item developers and enhance the item design, including scoring 
rules developed through what we can automatically mine in the responses? An implicit subtask 
then is mining the responses automatically and finding similar responses in one language—and 
across languages if possible. 

In the following, we first describe a language-independent methodology to tackle the 
above two tasks. We then describe some related work. Next we outline the implementation and 
present the results of the implementation to items and responses written in four natural 
languages, showing (a) this is a nontrivial enhancement to existing automatic scoring and (b) this 
solution helps us cluster responses and further enhances the rubrics or the scoring rules—hence, 
the scoring. Then, we discuss our analysis for the items and corresponding responses and its 
implications. We conclude with next steps. 

Method: A Language-Independent General Solution 
Longest Common Subsequence 

The basic solution we propose for the above tasks is based on calculating an approximate 
match or similarity measure between two textual sequences. For the first task, a prefix-infix- 
suffix omissions and insertions task, a similarity measure between a non-null unseen response 
and one that is true is calculated. For the second task, any two non-null responses, whether given 
in the scoring rules or produced by students, are compared via the same similarity measure. 
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The idea is simple: The closer the similarity—or the further the Dissimilarity —between a 
test taker’s response and a correct response, the more likely the response is true. In fact, the 
closer the similarity (or further the Dissimilarity) between any two non-null responses, Rj and Rj 
where Rj is more likely to be true, the more likely /G is true. 

In this study, we define the similarity measure between Texfi and Text 2 to be the longest 
common subsequences (LCSs; Cormen, Leiserson, Rivest, & Stein, 2001) between them. 
However, the methodology applied will still hold using other similarity measures. 

This particular approach, the use of an LCS to compare test takers’ responses, is 
motivated by biological applications, particularly those used in DNA or gene sequencing, 
bioinformatics, or genetics. 

In biology, a strand of DNA consists of molecules called bases: adenine (A), 
guanine (G), cytosine (C), and thymine (T). A strand of DNA is represented as a string over the 
finite set {A, G, C, T}, where a string is a finite sequence of symbols that are chosen from a set 
or alphabet. Two strands of DNA, D1 and D2, are similar if they have common bases that appear 
in the same order but not necessarily consecutively. Hence, this can be seen as generating a third 
strand, D3, consisting of these common bases. The longer D3 is, the more similar D1 and D2 are. 
Comparing two strands of DNA in order to verily how close two organisms are is very similar to 
our task. Figure 1 shows two strands of DNA for two different organisms. When compared to 
each other, a resulting set of common molecules are shown in the third strand in the figure. 
Figure 2 shows the same two strands with another longer set of common molecules. By finding 
similarities between sequences, scientists can infer the function of newly sequenced genes, 
predict new members of gene families, and explore evolutionary relationships. 



Figure 1. DNA sequences and commonalities. 
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Figure 2. A larger set of ordered commonalities. 

In general, beyond the DNA and gene alphabet (only 4-20 letters), this idea can be 
applied to any language. An alphabet of a language can be any set from which the strings of the 
language may be formed. This set can be finite or infinite, though in most cases it is finite. In our 
case, it makes sense to say that an alphabet, whether in a formal or natural language, is the set of 
symbols, letters, or tokens from which the strings of the language may be formed. In other 
words, a string is a finite sequence or ordered list of elements of an alphabet. A subsequence of a 
string or sequence S is an ordered subset (the same order used for the sequence) of the sequence. 

A test taker’s response can be seen as a strand of DNA or a string over an alphabet. The 
string, in this case, is a finite set of graphemes where a grapheme is the smallest semantically 
distinguishing unit in a written natural language. 1 A grapheme does not carry meaning by itself. 
Graphemes include alphabetic letters, Chinese characters, numerical digits, punctuation marks, 
and the individual symbols of any of the world’s writing systems. In the remainder of this 
document, we will use the term grapheme instead of character because we are dealing with a 
multilingual task where student responses can contain numerical digits or symbols. In our case, a 
subsequence of a string S in any natural language is a set of graphemes that appear in order—the 
same order as in the writing or script of the language (e.g., in Spanish from left to right, and in 
Arabic or Hebrew from right to left), but not necessarily consecutively. In a string such as naya, 
any of na, ay, aa, ya, ny, nay, or nya is an example of subsequences, but not a string such as an or 
aan. 

Given two strings or sequences, a common subsequence is one that appears in both, as 
shown in Figures 1 and 2. An LCS is a common subsequence that has a maximum size or length 
in terms of number of graphemes. For example, two strings, SI = jjj sslplj lppj ppslppspj ljj and S2 
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= sjsssspjjllpjsspppllpps, have an LCS of jsspllpppplpps. In general, the similarity between two 
ordered sequences, as defined above, is not necessarily unique. However, its size is. Consider for 
example, the two sequences, NAY and NYA; there are two LCSs, namely, ny and na. The size or 
length is 2. Table 3 shows examples of pairs of texts with some LCSs (this particular 
implementation for results in Table 3 includes white spaces as members of the alphabet). 


Table 3 

Examples of Pairs of Texts With LCS 


Text 1 

Text 2 

LCS 

Length 

the Organisation for phony academic 
publications 

the Organization for 
phony and pub 

the Organiation 
for phony ad pub 

32 

l'Organisation de faux publications 
academiques 

l'Organzation de 
pu|lication aca 

l'Organation de 
pulication aca 

31 

Feliz Ano Nuevo! Espero que todos esten 
muy bien. 

Espero verles a todos. 
Hasta muy pronto. 

Espero e todos st 
muy n. 

24 

3 — n sz/'SCldoVv'C, lUtT'OT F UXfe 



4 


LTV ^^2#tA±|1:Ac> 
MPnlT'fo IISTLLUA 
%/AL60%^ LTV 

BtLTVv£-fo o£!A A©StW 


To find an LCS, we can generate all subsequences and select one with maximum length. 
Actually, in some cases, the length might be all we are interested in, but as this is anticipated to 
be a general solution for many item types and additional educational tasks, we want to find the 
subsequences, too. 

Figures 3 and 4 show the two algorithms” for finding LCSs and finding the length of an 
LCS without having to calculate the subsequence, respectively, as described in Cormen et al. 
( 2001 ). 
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Let X= <X], X 2 , ..., x m > and Y= <yi,y 2 , v n > be sequences and let Z = < zi, Z 2 , Zk> be 
any LCS of X and Y. 

1. If x m = y n , then z k = x m = y n and Z*-i is an LCS of X m .\ and Y n _\ 

2. If x m 4 y „, then 

a. If z k 4 x m then Z is an LCS of X,„_i and Y 

b. If Zk 4y n then Z is an LCS of X and Y n .\ 

Figure 3. Calculating an LCS recursively. 


Assume that length (/,/) is the length of an LCS of sequences Xj and Y/. 


0 

II 

— 

S-i 

0 

0 

II 

M-h 

O 

length (i,;) = < 

length(i — l,j — 1) + 1 if i,j > 0 and x t = y j 


max (length(i,y — l),length(i — 1,_/)) if i,j > 0 and Xj Ay; 

V 


Figure 4. Calculating the length of an LCS recursively without producing the 
subsequences. 


In summary, we define a similarity between two texts or two test takers’ responses in this 
case, T1 and 12, each seen as a sequence of graphemes, to be 

Similarity (7\ , T 2 ) = LCS of graphemes between T x and T 2 , 

where the white spaces and punctuation marks can be included or excluded depending on needs. 
This function is more meaningful if we were to know the total number of graphemes in both T 7 
and T 2 . Let [//] denote the number of graphemes in a sequence of graphemes H. We calculate 
more indicative measures: 


or 


N Similarity^ ,T 2 ) — 


[Similarity (7\ ,T 2 )] 
[Ti] + [T 2 \ 


Dissimilarity (7\ ,T 2 ) — ([ T t \ — [Similarity (7\ ,T 2 )]) + ([ T 2 \ — [Similarity (7\ ,T 2 )]). 
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Obviously, the smaller the Dissimilarity measure, the more the similarity between the two 
sequences of graphemes. Again, normalizing the above measure using the total number of 
graphemes in the two sequences will be: 


NDissimilarity^!, T 2 ) 


([7J — [ Similarity (7\ , 7 2 )]) + ([T 2 ] — [ Similarity(T t ,T 2 )]) 

[Ti] + [T 2 ] 


[ T t ] + ([ T 2 ] - 2 [Similarity(T t , T 2 )]) 
[Ti] + [T 2 ] 


2 [Similarity^! ,T 2 )] 

[TJ + Fz] 


-1-2 NSimiIarity[Tp T 2 ). 


We recommend normalizing the measures because the application is going to be over a 
varied number of responses, items, and natural languages. Note that we can define 
Dissimilarity{Tl,T2 ) as [77] + [72] - Similarity(Tl ,T2), which makes NDissimilarity( 7/, Ti) 
simply l-NSimilarity( 7/, Ti). However, taking different combined measures into consideration is 
helpful and empirical evidence will confirm their adequacy. Having defined an approximate 
match or similarity measure, its use to solve our concerns is described in the following section. 

Scoring Sequences 

The first task is to automatically score unseen responses given a set of correct responses. 
Consider the set of non-null test takers’ responses where each response is seen as a sequence of 
graphemes (denote this set by ResUnseen ) and consider the set of distinct responses that are 
also correct or true where each response is seen as a sequence of graphemes (denote this set by 
ResTrue ), V xu L =£ 0 such that xu L E ResUnseen & V ytj such that ytj E ResTrue , calculate 
a four-tuple: 

A t j =< LCS(xiq, ytj), Length_of_LCS(xiij, ytj), Dissimilarityixup ytj), 

NDissimilarityixu L , ytj ) >. 
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For each item, an mxn matrix, Z, is obtained where m is the number of non-null 


responses and n is the number of distinct true responses: 



One way of reducing the dimensions of Z is to collapse similar responses into uniform or 
canonical representations, such as removing white spaces and punctuation marks. This means 
that only the uniform representations of xu t and ytj need to be compared. 

Figure 5 shows the comparison of each non-null response and each true response. Z t* 
denotes row t in Z matrix. For the z'th non-null response, first find the minimum Dissimilarity, A i 
= minj{NDissimilarity(ytj , xii^)}. 





Unseen 

True 

Response 1 • 

Response 1 

Unseen 

True 

Response 2 

Response2 < 

A .•••*? 

Unseen y 

Response M# - -- 

True 

Response N 



Figure 5. Compare each unseen to each true. 


The same logic applies if we were to choose Similarity, but in that case we will choose 
the maximum instead of minimum. Thus, A; is the minimum dy over all four-tuples in row Z,*. 
For simplicity, we will denote the arguments of A p with <a p by, c p <%>. Hence, A; is of the form 
c/ik for some k between 1 and n, that is, for some true response. Hence, Z’s dimensions get 
reduced to nz-by-1, and an m-by-1 column vector is obtained, as seen in Figure 6. Note that the 
selection of the minimum or maximum is the most intuitive and that is what we use in this study. 
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However, it is not the only option; one might, for example, select a linear function of the various 
(dis)similarities with weights corresponding to the number of repeated true responses. 


m-by-n 


m-by-1 


A„ 

A 12 

■ ■ ■ ■ 

A 1n 
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■ 

I 

< 

A m2 


A 

"mn 

-' 

■ v-^ 




A,- d,t, 

As-d* 


A,= djt. 


r 


A.- <U, 


Figure 6. m-by-n matrix to wi-by-1 matrix. 


However, at least for each item, the aim is to have one threshold or a tolerance, T, for 
Dissimilarity with what is considered correct or true, such that if 3 dij < F, then the score of 
xitj is true. Hence, the m-by-1 column vector will guide our choice of the threshold, T. In the 
following, we describe how. 


Selecting a Threshold 

One option is just to set a threshold based on statistics related to the number of 
graphemes over all responses or the minimum and maximum correct responses (when these are 
available) where its selection is general across all items. A crude option, for instance, 
independent of any data, is to select a random threshold such as 0.5 or use some heuristics over 
NDissimilaritv. For instance, select various thresholds that satisfy an arithmetic progression of 
the form r n = r m + (n - m) d over all items across all languages or a subset of items over a subset 
of languages. For example, starting with a threshold of 0.1 and difference d of 0.1, the set of 
thresholds to consider would be {0.1, 0.2, ..., upper bound of NDissimilaritv}. 

Another option is to use a set of training or field-test data to guide our selection of a 
tolerance of Dissimilarity that can be applied on an unseen set of responses. Whether human 
scoring takes place for the training dataset or not will have an impact on threshold selection. 
Figure 7 shows that if human scores are given for each item, then we use the human scores to 
learn a threshold of (dis)similarity. If not, then one option might be to define a one-to-one 
correspondence between a response produced by a candidate, C, and C’s performance over a 
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subtest of items belonging, for example, to the same stimulus or the whole test, in order to learn 
a threshold. In any of these cases, a machine-learning algorithm also can be used to learn a 
threshold-given item-related or response-related attributes, such as LCSs, Dissimilarity, and so 
on. In all the mentioned approaches for a threshold selection, using a training dataset, that is, a 
set of student responses labeled or scored manually, is considered a supervised approach, while 
the rest are unsupervised approaches for learning or selecting a threshold. 

In this report, we concentrate on this first case, that is, where human scores are given 
with the responses. The other case where human scores are not given will be described in a 
separate study. 

In Figure 7, a preprocessing step is introduced. This step might simply be extracting 
relevant infonnation from the log files or database used to store student responses or it could be 
removing white spaces and punctuation marks from student responses to obtain a uniform 
representation for a response or a sequence, and so on. In the evaluation section of this paper, the 
preprocessing step for this study will be specified. 

Figure 8 summarizes the process by which a threshold is selected given a set of student 
responses with their corresponding human scores (used as a training dataset). For each item, once 
the above matrix Z is obtained and reduced to the mxl vector of A* 5, as described above, then 
V xui ^ 0 in the training dataset, the question is whether xiq is scored manually as true or not. 

If it was true, then denote A; = An-, and if not, then denote A; = A; F . Find K such that K is 
maXj{A iT } A V A; F . A; F < K. If such a K exists, then set the threshold T = K. In practice, in this 
case, T is really one of the A iT s. If such a K does not exist, then select a T such that the number 
of responses labeled manually as true will be maximized while minimizing the number of false 
positives, that is, responses labeled manually as false that we do not want to wrongly score as 
true. Once a threshold is selected over a certain item and non-null response dataset, then this 
threshold can be used to score future unseen responses. 
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Process I 


Start 



Figure 7. Did human scoring occur? 


Process II 


Start 



Figure 8. Selecting a tolerance for Dissimilarity. 
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Clustering Sequences 

Calculating similarity measures allows us to cluster sequences or students’ responses. For 
each item, the similarity measure is calculated between all responses (given in the scoring rules 
or written by students). The matrix that we calculated earlier in Figure 5 will become square or 
an m-by-m matrix where m is the number of all available (possible) responses. In the case of 
LCS and the fact that it is a commutative relation, the m-by-m matrix obtained is symmetric with 
the diagonal elements equal to each other. 

Next, the responses are clustered into equivalence classes depending on their proximity to 
each other. The clusters of responses might be true with high certainty (green or a walking- 
person symbol) because of their proximity with correct or true responses, false with high 
certainty (red or open-arm person symbol) because of its remoteness to a true response, and 
uncertain (yellow or person-with-arms-down symbol) and need human intervention. This process 
can be made adaptive or dynamic in the sense that once the representative of the equivalence 
class or the cluster is labeled/scored (whether manually or automatically), a new unseen response 
can either fit into an already existing cluster or create a new cluster of its own (yellow) and the 
process can be repeated. Figure 9 illustrates the idea. 



Constant exercise 


As Mr Watson said "the brain is 
brilliant" 


Sculpting and music 


composers are so creative. They 
create things no one heard of 


Composers invent musical notes 


r 

How do scientists react to 
Alzheimer? 




f 

How can we tell that engineers are 
innovative scientists? 



• sculpting 

■ slpting usi 

■ culpting and music 
together 


■Alzheimer 

■ scientists and memory 
•toolsto remember 


Figure 9. Response clustering. 
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In the illustration in the figure, there are three clusters of responses, each corresponding 
to an item or a particular language for example, and each cluster has subclusters of green, red, 
and yellow labels. The first subcluster consists of true responses and the second consists of false 
responses. The text that appears in the box is one representative response of the cluster or 
equivalence class. If we were to click on the box, we would get a list of all responses 
corresponding to that cluster. As mentioned above, an unseen response can either fit under any of 
the existing clusters if it is similar to the members of that cluster/equivalence class or create a 
cluster of its own, which by default is labeled as notjabeled (yellow). In that case, a human can 
look at that newly formed cluster and change its label to red or green. 

Such clustering of responses corresponding to one item in one language or across 
languages has a nontrivial impact on content design, representation, and the scoring rules 
associated with this content. In the evaluation section, we will describe its impact and provide 
concrete examples. 

Before we go on to present the related work, implementation, and evaluation sections, we 
need to describe two methods that seem to be, at first glance, good solutions for this specific 
problem of the highlighting items. These methods are difference or commonality in terms of 
number of graphemes and LCS. We argue, however, that though they work for some cases, they 
are not general solutions, while the LCS-based solution is. 

A naive baseline that compares the length of a non-null unseen response (UR) to the 
length of each true response (TR): 

ate ([UR] - [TR]) 

Normalized difference =--—-—-— -. 

[UR]+[TR] 

This comparison of length is not a general solution for the problem, though it seems like 
it works in some cases. Consider the following example, where it seems acceptable to score the 
unseen response as true: 

• Unseen response: Sci discover t world that exists; engineers create the world that 
nev 

• True response: scientists discover the world that exists 

abs(69 - 42) 

Normalized difference = r „ n — r = 0.24. 

[69] +[42] 
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However, consider the following example where it is not acceptable to score the unseen 
response as true: 

• Unseen response: fire lookout towers roads water supply systems water drainage 
systems airports bridges sports stadiums toilet blocks marinas towers for wind 
generators elevated viewing platforms in forests or above lakes In short, engineers 
create the sorts of things we use all the time. Civil engineers can show their work 
to others with pride. And when these structures need repair or modification or 
when they've outlasted their usefulness, and need to be removed, these operations 
are the responsibility of civil engineers as well. 

• True response: What’s the difference between a scientist and an engineer? The 
well-known physicist Theodore von Newman once said, “Scientists discover the 
world that exists; engineers create the world that never was.” Or to use a literary 
metaphor, scientists write about the rules of poetry, whereas engineers write the 
poetry itself. 

abs(522-321) 

Normalized difference = ——-— --= 0.24. 

[522] + [321] 

Both have a normalized difference of 0.24. Hence, it is important to know if there is any 
commonality and not just a count. This does not mean that there might not be a false positive 
(i.e., a false response scored wrongly as a true response) using the LCS-based method once a 
threshold is selected, but it is more likely that there will not be as many relative to the naive 
length comparison. 

With the second method, LCS, the question is whether we can consider a substring and 
not a subsequence an approximate match measure, that is, 

Similarity(yt^, xuj = LCS of graphemes between ytj and xuj. 

The difference between a substring and a subsequence is that in a substring, only 
consecutive characters are considered, not just characters in the same order or direction of 
writing the language. This will also work in many cases, but it will not account for infix 
omissions. Consider the true response “quick dirty work,” and consider the student response 
“quck drty wok”; an LCS will be rty wo of Length 5 (excluding white spaces) and an LCS would 
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be quck drty wok of Length 11 (excluding white spaces). On the other hand, accounting for infix 
omissions might backfire, too, in some cases. Consider the true response “quick dirty work,” and 
consider the student response “life can be dandy, wild and honorable.” Then, an LCS would be i 
c d w or of Size 6 (excluding spaces) and an LCS would be a maximum of Length 2. 

A comparative evaluation between LCSs and substrings depends on the type of data in 
consideration and can only be shown empirically. However, intuitively, finding the LCSs is a 
more general and correct solution than finding the LCSs for our tasks. 

Related Work 

The medical literature has a lot to say about sequences and their comparisons. Sequence 
comparison can be done in two ways: qualitatively, or visually, and quantitatively. One type of 
comparison is known as alignment. It consists of two components: defining a similarity measure 
and defining an algorithm that will find the optimal alignment—in other words, defining a 
scoring scheme, scoring all sequence alignments, and then selecting the alignment with the best 
score. For example, in our case, the similarity measure is defined as the length of the LCSs; we 
assume different heuristic approaches to find the optimal alignment or select a threshold. 

In simple terms, sequence alignment is the most economical method to transfonn a 
sequence into another. For example, a sequence like ABCD is transformed to EBCD by 
substituting A with E, while to transform it to EBD, an additional deletion of C will be required. 
Hence, assuming all operations cost the same, the cost of aligning ABCD with EBCD is less than 
aligning it with EBD. The operators do not have to be restricted to deletion or omission or to 
substitution and transposition (called indels ) and they do not necessarily have equal cost. In 
general, given a set of operators over a set of alphabet, one can define a cost for transfonning one 
into another. 

There are different types of sequence alignment: global, local, and multiple sequence 
alignment. Global alignment is the best alignment over the entire length of the two sequences. It 
usually starts at the beginning of the two sequences and adds gaps to each until the end of one is 
reached. Local alignment refers to considering alignment over subsequences and not the entire 
length of the two sequences in question. It finds the regions of highest similarity between the two 
sequences and builds the alignment outward from there. Multiple sequence alignment involves 
more than two sequences. We will not go into it in this particular study. Table 4 lists some of the 
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well-known similarity measures, scoring schemes, and the algorithm(s)—local or global—used 
to find the optimal alignment. 

A brute force approach to find optimal alignment would be to generate all alignments, 
score them, and select the best score—an approach that is very impractical. The Needleman- 
Wunsch approach (Needleman & Wunsch, 1970) reduces the number of possibilities 
considerably yet guarantees that the best solution is still obtained. The basic idea is to build up 
the best alignment by using optimal alignments of smaller subsequences. Sellers (1974) used a 
metric distance to define similarity and select the best score. This led to the Smith-Waterman 
algorithm (Smith & Waterman, 1981; Arratia & Watennan, 1994), the most accurate in database 
search but the slowest. Since then there have been many variations for optimizations and more 
efficient algorithms. For example, Basic Local Alignment Search Tool (BLAST; Altschul, Gish, 
Miller, Myers, & Lipman, 1990; Altschul et al., 1997) is the tool most frequently used for 
calculating sequence similarity. BLAST was developed to provide a faster technique than 
another algorithm developed earlier, called FAST-A11 (FASTA; Pearson & Lipman, 1988). 
FASTA, which works with any alphabet, is an extension of two other tools, FAST-P (for protein) 
and FAST-N (for nucleotide). 

The edit distance (Levenshtein, 1965, 1966) is the number of deletions, insertions, or 
substitutions required to transfonn a string S into a string T. Several variations of this metric 
exist; for example, the Damerau-Levenshtein distance (Damerau, 1964). accounts for number of 
transpositions of two adjacent characters, too. The cost of each operation—insertion, deletion, 
transposition, substitution—might vary. 

One can easily see that any of these sequence alignment techniques can be seen as a 
variation of looking at our problem. In particular, when only insertion and deletion (no 
substitution) are allowed or when the cost of substitution is double the cost of an insertion or 
deletion, then for two sequences S = s t , S 2 , ■■■s n and T = tut 2 , —t n the edit distanced’, T)= n + m - 
2 LCS (S,T). Hence, one can easily make a comparative evaluation with the edit distance measure 
under these conditions. However, the reason the edit distance measure was not selected first for 
this particular application is that it is not always commutative. It is only commutative when the 
cost of each operation is the same. We wanted to start with a measure that will partition the space 
of responses into equivalence classes without additional conditions. 
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Table 4 


Some Well-Known Alignment Techniques 


Finding optimal alignment 

Type 

Comments 

Needleman-Wunsch (1970) 

Global alignment 

Same accuracy as Smith-Waterman. Slower 
than BLAST but faster than Smith-Waterman 

Smith-Waterman (1981) 

Local sequence 

Used in FASTA 

or better implementations of it alignment 
(Gotoh, 1982; Altschul & 

Erickson, 1986) 

Slower than BLAST but more accurate 
(exhaustive and uses dynamic programming). It 
is a variation of the Needleman-Wunsch. 

BLAST: Heuristic approach 
approximating the Smith- 
Waterman algorithm (Altschul 
etal., 1990) 

Local 

BLAST 

Faster than Smith-Waterman but less accurate 

Align 

Global 

FASTA 

Implements the Needleman-Wunsch global 
alignment algorithm 

LAlign 

Local: Huang and 

Miller (1991) 

FASTA 

Implements the Waterman local-alignment 
algorithm 

FASTA uses a hybrid of heuristic and 
exhaustive approaches. 

Needle 

Global 

EMBOSS 

Same program as Align, but it has an improved 
version called Stretcher (Myers & Miller, 1989) 

Edit-distance measure 

Global 

Several tools 

LCS 

Can be used both global 
and local 

Several tools including ours 


Sequence alignment is used beyond biology and bioinfonnatics. In fact, some of these 
computer science techniques might have been originally developed for text editing and text 
comparison but became more popular with biological applications. For example, file comparison 
programs such as the function diff 'm Unix that compares pairs of lines belonging to each file, in 
a sequential order, uses the LCS algorithm. The diff program was originally written by 
D. Mcllroy and J. W. Hunt. Their implementation was for an algorithm originally published by 
Hunt and Szymanski (1977). 
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In linguistics, sequence alignment has been used in various areas. For instance, in natural 
language generation, sequence alignment techniques were used to produce linguistic versions of 
computer-generated mathematical proofs (Barzilay & Lee, 2002). Spell-checking techniques 
could depend on sequence alignment (Sasu, 2011), and similarly for speech recognition (Pucher, 
Turk, Ajmera, & Fecher, 2007, Ziolko, Galka, Skurzok, & Jadczyk, 2010). In fact, many 
sequence algorithms had extensive use and some success in speech recognition since the early 
1990s. Another application has been plagiarism detection (Lukashenko, Graudina, & 
Grundspenkis, 2007; Su et al., 2008). In comparative linguistics, sequence alignment was used to 
compare or reconstruct languages automatically (Kondrak, 2002). 

Furthermore, alignment techniques were used in business applications—for example, to 
analyze purchases over time (Prinzie & Van den Poel, 2005). 

This Study 

Description 

For this particular study, we didn’t have access to the minimum evidence, maximum 
evidence or scoring rules, only the automatic scores obtained using symbolic logic strict rules. 
Also, beyond the minimum, missing characters within the maximum were tolerated in the 
existing implementation of the scoring rules. For instance, a response, “Hw birth ra less than 
average fell 4% between 2007-2009 low due to unfavorable economic conditions,” for the item 
in Table 1 is scored automatically as true while “Hw birth ra less than averag fell 4% between 
2007-2009 low due to unfavorable economic conditions” is scored as false. Hence, the scoring 
task was reduced to the following: For each item, given the responses and their scores, where 
scores have been obtained automatically using the propositional logical rules, the aim is to 
correct the scores of responses scored as false due to insertion or omission of characters. 

The processes and methodologies were applied as described earlier in the method section. 
The set of non-null test takers’ responses that were scored automatically as false were considered 
instead of the set of unseen responses, ResUnseen, and the set of distinct responses that were 
scored automatically as true were considered ResTrue. 

The clustering task was reduced to the available responses with no access to true 
responses except the ones scored automatically as true. In both tasks, the assumption was that 
responses scored automatically as true are scored correctly. 
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Implementation 

This is an intuitively simple problem, but surprisingly difficult to solve and implement 
correctly. Methods to improve perfonnance efficiency in terms of time and memory can make a 
difference for implementation, as the literature suggests. In the following, we provide a 
complexity analysis for each item that can be easily generalized to any number of items. 

The preprocessing step includes (a) finding all responses that were labeled automatically 
as false, (b) finding all responses that were labeled automatically as true (and any true responses 
if we have access to the minimum and maximum), (c) filtering out all null responses labeled as 
false, and (d) considering distinct true responses, and then transforming them to a uniform 
representation. Hence, same or uniformly represented responses are identified and only one is 
considered in the processing. Also, we applied some quality assurance steps, such as no null 
responses being scored as true, and each candidate has one and only one response (the latest he 
or she produced). 

Transforming the non-null responses labeled as false to a uniform representation is 
performed similarly to true responses for consistency and efficiency, keeping in mind that 
candidate IDs for responses labeled as false have to be possible to retrieve. Also, the results for 
each pair true response ( ytj ) and false response (x/j): <Tcxtl/x/ ( , Text2/yt ; -, LCS, Length, 
Dissimilarity, and NDissimilarity> can be stored in a database in order to avoid recalculating for 
the same pair of sequences or texts. If Text 1 and Text2 have already been encountered, we just 
look them up. 

Hence, in practice, one would be dealing with M distinct uniformly represented non-null 
responses labeled as false and N distinct uniformly represented true responses. 

All of the above preprocessing steps are either constant or linear in the number of 
responses labeled as true or false. Hence, this preprocessing step will be ignored as part of the 
analysis of runtime. Also, once the Mx N matrix is produced, calculating A; and selecting the 
threshold also takes a constant time. The runtime of calculating the matrix, in the worst case 
scenario, is: 


F(Z) = MxN 0(LCS(Tj, 7})) where/= 1, 2, ...,/« andy = 1,2,...,«. 


Hence, the process of comparing two texts, that is, finding the LCSs, is what is crucial to 
analyze. This is a classical computer science problem and has been studied extensively. Many 
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suggested implementations with various complexities exist in the literature (Hadlock, 1988; 
Hirschberg, 1977; Hunt & Szymanski, 1977; Masek & Paterson, 1980; Myers, 1986; Nakatsu, 
Kambayashi, & Yajima, 1982; Wagner & Fischer, 1974). 

As seen in Table 4, the algorithm is recursive. However, if it would be implemented just 
recursively by calculating all subsequences and selecting one, then its runtime in the worst-case 
scenario would be exponential in the number of characters in the two texts. If implemented in a 
dynamic way or using memorization (Wagner & Fisher, 1974), then it has an 0(nxm ) worst-case 
running time where n and m are the number of graphemes in each sequence. Its space 
requirement in the worst case is quadratic, too. Hirschberg’s algorithm (1975) reduced the space 

71X771 

requirement to 0(n + m ). Later, an improvement to 0( logn + m ) was suggested by Masek and 
Paterson (1980). 

Other existing algorithms present complexities that depend on parameters other than n. 
For example, Myers (1986) and Nakatsu et al. (1982) suggested an algorithm with 0((n + in) D) 
where D is the simple Levenshtein distance between two given strings. Iliopoulous and Rahman 
(2008) suggested an algorithm with 0((n + m) R) where R is the total number of ordered pairs of 
positions at which the two sequences match. In 2011, Thang (2011) suggested that if the two 
sequences belonged to two finite languages accepted by two finite automata, Al and A2, then the 
algorithm of finding the LCS is 0(\Aj\ \AJ\) worst-case running time, where \A\ is the number of 
states and edges of automata A t . 

In our case, for the first prototype, we implemented a dynamic recursive algorithm in 
Sociaal-Wetenschappelijke Informatica (SWI) Prolog with a quadratic time and quadratic space 
requirement. Graphemes were transformed to Unicode (utf8) representation to make the 
implementation natural language-independent. 

In an operational setting, we need only one true response with Dissimilarity less than or 
equal to the threshold for the false responses to switch to a true response. 

As mentioned earlier, for efficiency, the results for each pair—Textl/x/j, Text2 lytj, 

LCS, Length of LCS, Dissimilarity, and NDissimilarity —have been stored in a database in order 
to avoid recalculating for the same pair of sequences or texts. If Textl and Text2 already have 
been encountered, we just look them up. 
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Evaluation 


For the evaluation of the approach and techniques, four highlighting items belonging to 
the item type previously described, written in English (EN), Spanish (ES), French (FR), and 
Japanese (JA), were considered. The language codes were adopted from Wikipedia (“List of ISO 
639-1 codes,” n.d.). All responses in all languages were scored manually. All English, French, 
and Spanish responses were scored by the same human rater. All the Japanese responses were 
scored by one human rater, who was different from the person who scored the English, French, 
and Spanish. The responses were collected in Chile (CL), Canada (CA), Britain (GB), France 
(FR), Spain (ES), Ireland (IE), and Japan (JP). The country codes were adopted from Wikipedia 
(“ISO 3166-2,” n.d.). 


Results 

We first include some average sequence length for different languages in each item, 
which illustrates how sequences considered in this task are by far much shorter than sequences 
considered in DNA sequence algorithms in general. It also gives an idea on the range of true 
student responses. Figure 10 summarizes the range and the average number of graphemes of 
non-null false and true responses. In general, on average, the number of graphemes for responses 
scored automatically as false is larger than that of responses scored as true. 
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♦ Average # of graphemes in true responses 

— Average # of graphemes in false responses 


Figure 10. Average number of graphemes in the 32 <country, language> considered. 


In the following, we will present the results of our scoring approach for each item 
separately. 
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Because there is very little data, we decided to train on responses associated with one 
country only, regardless of the size of the dataset associated with that country and its language. 
Once a threshold is selected based on that country, the evaluation is performed using the datasets 
of responses corresponding to the rest of the countries even if the datasets belonged to other 
languages. This assumes that the rubrics for each item in all countries are defined in the same 
way. There was no specific methodology for selecting the set of training data except the order of 
the files listed. In other words, there is no intuitive reason for favoring one country or language 
over the other. Hence, the methodology could be modified when more data are available. For 
instance, the experiment could be repeated by dividing the number of responses equally or by 
treating each language separately because some issues might be language-specific issues. 

In the rest of the document we will refer to automatic scores based on the symbolic 
logical rules given with each item as original automatic scores and the scores based on our LCS 
method as updated automatic scores. Recall that there are several assumptions. First, there is no 
access to the scoring rules. Second, the original true responses are correct. In fact, there is 
absolutely no reason why any of them should be incorrect because the existing Boolean 
implementation was verified. The aim is to rescore the set of non-null responses with original 
scores false. 

Tables 5-16 consist of three tables of results for each of the four items. The first table for 
an item contains (a) the total number of student responses, (b) the number of responses with 
original automatic scores that are false, (c) the number of non-null responses with original scores 
that are false, (d) the number of non-null distinct responses with original scores that are false, (e) 
the number of non-null distinct unifonn responses with original scores that are false, (f) the 
number of responses with original automatic scores that are true, (g) the number of distinct 
responses with original scores that are true, and (h) the number of distinct responses represented 
in a uniform way with original scores that are true. Recall that a uniform representation is one 
with no punctuations and no spaces. Some languages, such as Japanese, have fewer punctuation 
marks than other languages and graphemes are not separated with white spaces; hence, a unifonn 
representation might not make much difference. 

The second table for an item contains (a) the number of non-null response with a human 
score of false with an original automatic score that is false, (b) the number of non-null false 
responses labeled by humans as true, (c) the number of responses scored by humans as false that 
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are corrected properly to true using our scoring method, and (d) the number of responses scored 
by humans as false but wrongly scored by our method as true. 

The third table summarizes the results for each item in terms of accuracy, precision, 
recall, and true negative rate. The meaning of precision = 1 and recall = 1 is that the method used 
has yielded the whole truth and nothing but the truth. In our case, precision = 1 means that the 
new automatic scoring method is definitely an improvement over the existing technique, that is, 
the implementation of the Boolean scoring rules. 

Results for Item 1 

The following three tables of results correspond to Item 1. 


Table 5 


Item 1: False Versus True Responses as Originally Scored Automatically 



Total 

False 

Non-null 
false 

False 

distinct 

non-null 

False 

distinct 

non-null uniform 

True 

True 

distinct 

True 

distinct 

uniform 

IE EN 

85 

52 

42 

32 

30 

33 

14 

5 

GB EN 

69 

53 

38 

26 

26 

16 

11 

6 

CA EN 

48 

30 

19 

17 

17 

18 

9 

6 

CL ES 

94 

70 

44 

34 

32 

24 

9 

4 

ES ES 

127 

90 

58 

40 

40 

37 

16 

10 

FR FR 

244 

164 

108 

85 

82 

80 

29 

27 

CA FR 

66 

41 

29 

19 

19 

25 

13 

11 

JP JA 

153 

75 

50 

43 

43 

78 

23 

23 


Training on IE EN, a threshold of 0.19 is learned based on the responses labeled by 
humans. Table 6 shows the results for the evaluation using the datasets corresponding to other 
countries and languages with the threshold of 0.19. In GB EN, five of six are scored correctly to 
true with the threshold selected. One response has a minimum Dissimilarity of 0.21 with a true 
response; hence, it is missed being scored correctly. CA_EN has zero true responses labeled by 
humans. In ES_ES, the three responses get corrected. Actually, it turns out they only need a 
threshold of 0.07. In FRFR, nine of the 12 pass, but three responses need minimum 
dissimilarities of 0.2, 0.22, and 0.25, respectively, to be corrected. The eight responses in CA_FR 
pass and get scored correctly. In fact, a threshold of 0.16 turns out to be enough for these 
responses. For JP JA, only three of the 16 responses pass the threshold, while each of the rest 
has a maximum Dissimilarity with a true response greater than 0.19. 
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Table 6 


Item 1: How Many False Get Corrected Properly to True? 



Non-null with 
original 

automatic score 
as false 

Non-null original 
score false 
scored by 
humans as true 

Non-null original score 
false with human score 
true updated automatic 
score as true 

Non-null original score 
false with human score 
false with updated 
automatic score as true 

IE EN (training) 

42 

3 

3 

0 

GB EN 

38 

6 

5 

0 

CA EN 

19 

0 

0 

0 

CL ES 

44 

1 

1 

0 

ES ES 

58 

4 

4 

0 

FR FR 

108 

12 

9 

0 

CA FR 

29 

8 

8 

0 

JP JA 

50 

16 

3 

0 


One of the false responses in CLES that was labeled by humans as false and was about 
to pass the threshold (0.199) was the following. The student highlighted the true response but 
some additional infonnation was also highlighted that—though it kept it in the proximity of 
true—was a part of a quote from someone who is different from the person in the prompt of the 
item: 


s. El entrenamiento especifico mejora la atencion dividida,” senala Sekuler...“Una 
sesion de entrenamiento de tan solo dos horas mejora los resultados de las pruebas de las 
personas mayores. El entrenamiento sostenido puede hacer maravillas” 


This is one case where the context, or more simply the position of the graphemes in the 
passages, can be used to help. Table 7 summarizes the results. 


Table 7 

Item 1: How Many False Get Corrected Properly to True? (Summary) 


Item 

Training of positive instances 
(# of negative instances) 

Evaluation Accuracy 

Precision 

Recall 

True 

negative rate 

Item 1 

3 (39) 

346 0.95 

1 

0.63 

1 

Item l a 


0.99 

1 

0.96 

1 


“Accuracy measures recalculated without Japanese responses. 
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The positive instances are the responses scored by humans as true and the negative 
instances are the ones scored by humans as false. The second row shows the results if we were to 
recalculate the accuracy measures above without considering the Japanese responses. 


Results for Item 2 

The following three tables of results correspond to Item 2. 


Table 8 


Item 2: False Versus True as Originally Automatically Scored 



Total 

False 

False 

distinct 

Non-null False distinct non-null 
false non-null uniform True 

True 

distinct 

True 

distinct 

uniform 

IE_EN 

81 

23 

17 

13 

13 

58 

3 

2 

GB EN 

66 

16 

5 

4 

4 

50 

5 

5 

CA EN 

44 

13 

5 

5 

5 

31 

4 

4 

CL ES 

107 

34 

21 

19 

18 

73 

7 

7 

ES ES 

128 

44 

20 

14 

14 

84 

7 

7 

FR FR 

213 

73 

27 

16 

16 

140 

12 

11 

CA FR 

56 

23 

10 

10 

10 

33 

4 

4 

JP JA 

157 

36 

16 

15 

15 

121 

13 

13 


Training on IEEN, a threshold of 0.69 is selected with one false response with a human 
label of false getting an updated score as true (with a 0.5 minimum NDissimilarity and a 
response of 5). To evaluate, the rest of the countries were used and results are listed in Table 10. 
In GBEN, the two responses are corrected properly; after inspection, a 0.33 threshold is enough 
for the two responses. The response in CA_EN needs an NDis similarity of 0.7. Hence, it gets 
missed with a 0.69 threshold. For ES ES, the five responses pass the threshold. In fact, all 
responses require NDissimilarity of 0.2-0.55. Similarly, CL_ES’s five responses pass with 
maximum NDissimilarity of 0.33. In particular, 0.13, 0.2, 0.27, 0.33 is the minimum proximity of 
each response to one of the true responses. For FRFR, the eight responses pass with 0.06-0.56 
NDissimilarity range. For CA_FR, two responses fail to pass the threshold requiring 0.74 and 
0.75 NDissimilarity, while one response passes with 0.22 NDissimilarity. For JP_JA, five out of 
the five pass (0.01-0.17). 
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Table 9 


Item 2: How Many False Get Corrected Properly to True? 



Non-null with 
original automatic 
score as false 

Non-null original 
score false scored 
by humans as true 

Non-null original score 
false with human score 
true with updated 
automatic score as true 

Non-null original score 
false with human score 
false with updated 
automatic score as true 

IE EN (training) 

17 

9 

9 

1 

GB EN 

5 

2 

2 

1 

CA EN 

5 

1 

0 

1 

CL ES 

21 

5 

5 

11 

ES ES 

20 

6 

6 

4 

FR FR 

27 

8 

8 

2 

CA FR 

10 

3 

1 

2 

JP JA 

16 

8 

8 

0 


For this particular item, the majority of correct responses are the same. There is not much 
variety in correct responses with which to compare. Hence, a similar response to the following 
true response is deemed too distant, unfortunately: 

When you make a direct call to Portugal from another country, after dialing the number 
which gains access to the international service (which varies from country to country), 
you should dial 351 (the code for Portugal) and the inter-urban code without 

On the other hand, a response in GBEN whose original score is false and whose human 
score is false such as “For information about the codes for countries and places that are not 
included in the list, please dial: For help connecting a call, please dial: Some Country, Regional 
and City Codes In the case of countries marked with a you only ne” gets a minimum 
Dissimilarity score of 0.673 with one true response, “351 (the code for Portugal) and the inter- 
urban code without the first 0.” 

In CA_EN, a response whose original score is false and whose human score is false such 
as “For information about the codes for countries and places that are not included in the list, 
please dial: 351” has a minimum NDissimilarity of 0.668 with one of the true responses,” you 
should dial 351 (the code for Portugal).” Hence, it passes the threshold of 0.69. 

This latter example requires some knowledge about the context in the rest of the false 
response. We might want to consider the proximity or distance from more than one true response 
and/or other false responses, as we suggest in the clustering approach where all responses get 
compared. 
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For CL_ES, a minimum NDissimilarity is shared between two responses, 35 and 00 49 
351. The original automatic scores for these two responses are false. They share the 
same NDissimilarity of 0.43 when compared to a correct response, 1351, while one of them is 
labeled true by humans and the other is labeled false. 

For ES ES, five responses that are supposed to be true all pass and their minimum 
NDissimilarity is either 0.2, 0.25, or 0.53, while the six false positives have NDissimilarity of 
0.53,0.55,0.59,0.67,0.68. 

For FR_FR, the two false positives pass with 0.59 and 0.57 NDissimilarities, while for 
CA_FR, the false positives pass with 0.54 and 0.62 NDis similarities. 

Table 10 summarizes the results. 

Table 10 


Item 2: How Many False Get Corrected Properly to True? (Summary) 


Item 

Training # of positive instances 
(# of negative instances) 

Evaluation Accuracy 

Precision 

Recall 

True negative rate 

Item 2 

9(8) 

104 0.76 

0.57 

0.9 

0.69 

Item 2 a 


0.76 

0.57 

0.9 

0.69 


“Accuracy measures recalculated without Japanese responses. There would be no difference in 
the result if we were to exclude the Japanese responses. 


Results for Item 3 

The following three tables of results correspond to Item 3. 


Table 11 


Item 3: False Versus True as Originally Automatically Scored 



Total 

False 

Non-null 

false 

False 

distinct 

non-null 

False distinct 
non-null 
uniform 

True 

True 

distinct 

True 

distinct 

uniform 

IE EN 

81 

20 

11 

10 

10 

61 

1 

1 

GB EN 

63 

17 

8 

7 

7 

46 

4 

4 

CA EN 

46 

13 

4 

4 

4 

33 

3 

3 

CL ES 

101 

38 

9 

8 

8 

63 

2 

2 

ES ES 

115 

42 

12 

9 

9 

73 

3 

3 

FR FR 

216 

71 

19 

15 

15 

145 

5 

5 

CA FR 

39 

11 

1 

1 

1 

28 

3 

3 

JP JA 

148 

24 

6 

6 

6 

124 

4 

4 
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For this item, except for Japanese where responses needed NDissimilarity of 0.6 and 0.2 
to pass, most responses were corrected with the selected threshold of 0.09, based on the training 
set in IE_EN. For CL_ES, one response, “Agencia E ad y la Salud,” that says something like “E 
Agency y and Health” for “European Agency for Safety and Health at Work” requires a 0.4 
NDissimilarity to pass. For ES_ES, six out of eight responses pass with the 0.09 threshold. The 
other two require NDissimilarity of 0.17 and 0.13, respectively. For FR_FR, 10 out of the 12 
pass, while the two remaining responses require 0.1 and 0.17 NDissimilarity. 

Table 12 


Item 3: How Many False Get Corrected Properly to True? 



Non-null with 
original 

automatic score 
as false 

Non-null 
original score 
false scored 
by humans as 
true 

Non-null original 
score false with 
human score true 
updated automatic 
score as true 

Non-null original 
score false with 
human score false 
updated automatic 
score as true 

IE EN (training) 

11 

8 

8 

0 

GB EN 

8 

6 

6 

0 

CA EN 

4 

3 

3 

0 

CL ES 

9 

5 

4 

0 

ES ES 

12 

8 

6 

0 

FR FR 

19 

12 

10 

0 

CA FR 

1 

1 

1 

0 

JP JA 

6 

2 

0 

0 


Table 13 summarizes the results. 

Table 13 


Item 3: How Many False Get Corrected Properly to True? (Summary) 


Item 

Training 

# of positive instances 
(# of negative instances) 

Evaluation 

Accuracy 

Precision 

Recall 

True 

negative rate 

Item 3 

8(3) 

60 

0.88 

1 

0.81 

1 

Item 3 a 



0.94 

1 

0.86 

1 


“Accuracy measures recalculated without Japanese responses. 


Results for Item 4 

The following three tables of results correspond to Item 4. 
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Table 14 


Item 4: False Versus True as Originally Automatically Scored 



Total 

False 

Non-null 

false 

False 

distinct 

non-null 

False 

distinct 

non-null 

uniform 

True 

True 

distinct 

True 

distinct 

uniform 

IEEN 

85 

47 

47 

41 

39 

38 

18 

17 

GBEN 

69 

43 

31 

22 

21 

26 

11 

10 

CAEN 

48 

31 

24 

20 

18 

17 

11 

10 

CLES 

94 

64 

43 

30 

28 

30 

17 

15 

ESES 

127 

72 

46 

28 

26 

55 

23 

21 

FRFR 

246 

157 

117 

84 

79 

89 

29 

25 

CAFR 

66 

37 

29 

22 

21 

29 

16 

15 

JPJA 

153 

84 

63 

41 

41 

69 

25 

25 


With no human-labeled true responses for IEEN, we opted to use the GBEN dataset 
and a threshold of 0.22 was selected. As mentioned earlier, there was no particular methodology 
on selecting the training dataset for this particular study. When there were no useful responses 
corresponding to a particular <country, language> pair, the next dataset available corresponding 
to any <country, language> pair was used to learn a threshold. 

Table 15 


Item 4: How Many False Get Corrected Properly to True? 



Non-null 

with 

original 

automatic 

score as 

false 

Non-null 
original score 
false scored by 
humans as true 

Non-null original 
score false with 
human score true 
updated automatic 
score as true 

Non-null original 
score false with 
human score false 
with updated 
automatic score as 
true 

IEEN 

47 

0 

0 

0 

GB EN (training) 

31 

2 

2 

0 

CA EN 

24 

2 

2 

0 

CL ES 

43 

18 

3 

0 

ES ES 

46 

0 

0 

0 

FR FR 

117 

11 

11 

0 

CA FR 

29 

3 

3 

0 

JP JA 

63 

0 

0 

6 


The two responses labeled true by humans in CA_EN each have a Dissimilarity of 0 with 
one of the true responses. It is similar for one response labeled true by humans in CAFR. 

Hence, it is not clear why these have an original automatic score of false unless there is a space 
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or some other grapheme missing that was not captured while extracting the text from the log file. 
For CL_ES, only three out of 12 pass while the rest have NDissimilarity ranging from 0.26-0.63. 

Except for JPJA, none of the false responses labeled by humans as false get an updated 
automatic score of true. The six responses in Japanese have minimum NDissimilarity ranging 
from 0.078-0.215. 

Table 16 summarizes the results. 

Table 16 


Item 4: How Many False Get Corrected Properly to True? (Summary) 


Item 

Training 

# of positive instances 
(# of negative instances) 

Evaluation 

Accuracy 

Precision 

Recall 

True 

negative rate 

Item 4 

2(29) 

369 

0.95 

0.76 

0.55 

0.98 

Item 4 a 



0.95 

1 

0.55 

1 


a Accuracy measures recalculated without Japanese responses. There would be no difference in 
the result if we were to exclude the Japanese responses. 


For the four items in the four languages and the eight countries considered, there is a 
nontrivial percentage of responses whose original automatic score is false that get corrected 
properly to true, while each item has very different expected evidence and each human rater 
might have been more strict or lenient (no double scoring for the same responses and items have 
occurred to compare Human-Human agreement). However, as the precision and recall tables 
show, this is not the whole story. In one particular case (Item 2), the number of false positives is 
nontrivial. The questions then are: (a) Is erring on the positive side or the negative side better, 
that is, is it better to grant more true scores than false, and (b) Might it be the case that one can 
categorize the types of scoring rules associated with this item type into ones where this LCS 
method works better for some categories rather than others? Saying this, note that the number of 
training and evaluation data is very small and language-specific thresholds might hold better. For 
three of the four items, both Precision and Recall are either one or very close to one. Hence, 
these are considered excellent results. That said, evaluations with more items and responses will 
reveal the strengths and weaknesses of this approach. 
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Discussion 

In the following, we describe some general observations and issues that might exist. 

These are divided into human scoring issues and issues with linguistic and cultural content 
analysis that have an effect on both human and automatic scoring. 

Human Scoring 

We found that human scoring was more challenging than expected. Deciding where one 
draws the line about what is considered false or true was not so straightforward, given the current 
content representation. 

For instance, we looked at the following example: 

’ 7 — Urn 5o V 

litsi ’ 

that, translated to English, is 

“'That’s the bad news, says Sekular.’” A training session of as little as two hours 
improves the test results of older people. Sustained training can work wonders.” 

It was hard to decide whether to score this as true or false. It is true because the student 
responded correctly in the part about “A training session of as little .. .wonders.” The additional 
part, that is, that violates the maximum, namely, “that’s the bad news, says Sekular, ” appears in 
a very different column in the passage. Hence, it is unlikely that the student guessed the correct 
response. In fact, the prompt specifically asks for some other person’s take on the issue. On the 
other hand, the additional part has no semantic relevance or coherence with the correct response. 
The human scoring allowed us to categorize responses according to the following: 

Highlighting Beyond or off the Maximum 

• What seems like a pure mechanical mistake: When a student seems to 
unintentionally omit or add and ends up with a response beyond the maximum. 

• What seems semantically legitimate: When a student highlights additional 
information either as: (a) a specific detail such as an illustration or an example 
confirming the correct response, or, (b) what seems to be logical coherent 


33 



continuation of a correct response but not as specific as an illustration or an 
example. 

• What is legitimately wrong: When a student highlights clearly wrong 
information—for instance, information that is incoherent with the maximum or 
highlighting the whole passage. 

Highlighting Insufficient or Inadequate Minimum 

• What seems like a pure mechanical mistake: Similar to the above case, a student 
omits or adds unintentionally and ends up with a response that is insufficient or 
inadequate. 

• What seems semantically legitimate: When the occurrence of a certain minimum 
in a certain position is not acceptable, though it does not seem there is any 
contradictory or unsuitable context. 

• What is legitimately wrong: When an item is clearly wrong, for instance, stress as 
a response to “how to train your memory?” 

Question Marks 

We give this nomenclature for cases where we are not sure why responses have been 
scored as such. For example, some responses that are clearly correct were given a false label 
automatically. This might have been due to missing spaces or preprocessing steps that were not 
extracted in our data. 


A Deeper Content Analysis 

We will present our observations in five categories. 

Natural Language Translation-Based Issues 

It might be possible that some meaning variations are introduced when translating from 
one language to another. In other words, students responding in Language X might be, 
unintentionally, favored over students responding in Language Y. The issues we noticed until 
now had to do with the selection of lexical entities by item developers. A lexical entity comprises 
a word, a compound, or a multiword lexical entity. In the rest of the document, we will just refer 
to it as a word. 
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Introducing ambiguities. This occurs when translating a passage from Language LI to 
Language L2. A word that belongs to the minimum correct response is used in L2 in more than 
one position in the passage with the right context, while in LI different words are used or the 
same word with different implicit contexts is used. This repetition might increase the ambiguity 

for students. For instance, in jft rf 1 (fjdfifjl sil I. the word training in Japanese is introduced in 

more than one position in the passage with the right context. Hence, either response is correct, 
while the rubrics specified one position to be correct and the other not. The reason is that in 
English there was no such reoccurrence or ambiguity between the two occurrences. 

Reducing ambiguities. This occurs when translating a passage from Language LI to 
Language L2. A word that belongs to the minimum correct response is used in L2 in more than 
one position in the passage that might reduce the ambiguity for students responding in L2. For 
instance, in English, the word create occurs more than once in a certain passage, yet only one 
occurrence of the word is associated with creativity and abstract artistic talents like poetry. The 
same happens in Spanish and French. When looking at the same passage in Japanese, two 
different words were used—one associated with abstract objects and one associated with 
concrete objects. A student responding in Japanese is more likely to select the word associated 
with abstract things when asked about art, while the students in other languages will not have 
that extra resolution of ambiguity between the occurrences of create. 

Neutral but different translation. An example would be the use of numbers expressed 
in words rather than numerals, such as twenty versus 20. 

Content Word Order 

Denoting a subject in a sentence by S, a verb by V, and an object by O, then languages 
could be divided between SVO, SOV, VSO, VOS, OVS, and OSV. We are used to languages 
with SVO orders, but actually the most common order in the languages of the world, including 
Japanese, is SOV. The other orders occur in very small percentages. 

This might create issues because a content word such as a verb or a noun (object or a 
subject) might appear in the minimum. For example, consider the minimum evidence, 
“electronic products,” and a response, “Japanese build electronic products smoothly and 
elegantly.” In Japanese syntax, electronic products would not be placed between build and the 
adverbs, but located earlier, so Japanese students might highlight “build smoothly.” Hence, their 
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response is correct, but it is inadequate minimum (because minimum evidence is “electronic 
products”). However, either the meaning is implicit or they missed “electronic products” by 
mistake. 

Background, Cultural, or World Knowledge Variations 

There are variations that we attribute to background, cultural, and world knowledge. A 
few examples follow. 

The prefix used to place a phone call abroad varies from one country to another. Though 
this is stated in the passage corresponding to one item, it is an issue that people in general have 
when they move to a new country. In the United Kingdom, for instance, calling abroad requires 
00, but in the United States, Oil. Some students tend to highlight the 00 if there was any 
occurrence of 00 in the text, in addition to what is required based on, we hypothesize, the 
student’s home country. The maximum in the scoring rules might have to take this into 
consideration. 

Another instance is related to professions. The fact that architects are considered artists 
and not engineers in some countries might be a reason why many students highlight the fact that 
engineers are associated with architects in the text as a sign of being artistic. The same goes for 
the construction of structures such as bridges because it is creative and artistic. 

Same Language, Different Variations 

Within the same language, there were variations. Between the passage written in French 
in France and the one in Canada, there were many variations in the translation. For instance, 
wonders was translated to wonders/merveilles in France but translated to miracles in Canada. 
Sustained was kept sustained in Canadian French, but regular/regulier was used in France. 

• French in France: Un entrainement regulier peut faire des merveilles. 

• French in Canada: Un entrainement soutenu peut faire des miracles. 

Many more variations existed. Consider another example in Spanish such as the following 
different translations or representations of the same text: 

• Spanish in Chile: Sin embargo, la degeneracion de la capacidad intelectual puede 
prevenirse de verdad, incluso aunque no nos dediquemos a ejercicios 
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experimentales tan extremos. Como dice Lehrl, “cualquiera que este interesado en 
cosas nuevas mantendra la materia gris.” 

• Spanish in Spain (Castellano): La degeneracion de la capacidad intelectual de 
todos modos se puede evitar, incluso sin darse el lujo de considerar el 
pensamiento como un ejercicio experimental extremo. Tal como lo sefiala Lehrl: 
“cualquiera que este interesado en nuevas cosas mantendra la materia gris en 
forma.” 

Learning the threshold using only one type of French or Spanish might not be as accurate 
for evaluation. Only a broader evaluation effort can tell. 

Negation as Its Own Special Case 

This case is for items where the minimum includes a negation. We only focus on explicit 
negations that are clearly modified by not. In our study we had no such items, but the approach 
we used will have to be updated to take care of, for instance, not less than average instead of less 
than average in the example in Table 1. 

All of the above issues can be seen by looking at the data, but some can be inferred from 
the hierarchy of clusters (ranked in order). On the one hand, all types of issues apply to all 
languages, but that is not the only thing involved. In some cases, the students’ responses can be 
clustered as similar across languages. For instance, in one of the items, many students 
responding in English and Japanese selected the same wrong response: 

a training session of as little as two hours improves the test results of older people. 

Sustained training can work wonders. Fie is now pressing ahead with intensive memory 

exerices with a group of elders in good health who are intending to compete next year 

against young memory whiz kids. 

The high-ranked clusters that we can perform over the proximity of responses to each other will 
confirm whether and how the rubrics can be refined. 
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Conclusion 

We used one sequence alignment technique in order to evaluate multilingual student 
responses in international assessment. We showed it is an enhancement of what already exists, 
that is, the symbolic Boolean logic rules implementation. 

In addition to automatic scoring, this solution plays an important role in clustering 
students’ responses; consequently, it has a nontrivial impact on the refinement of the items 
and/or their scoring guidelines. The strengths of such an approach are that (a) it is language- 
independent, (b) it is adaptive, and, most importantly, (c) though designed for a specific problem, 
the proposed solution is general enough for any educational task that can be transfonned into a 
sequential one even beyond text and speech applications. To name a few: It can be used for a set 
of actions expected from a student in simulations or learning trajectories as projected by a 
teacher, inside an intelligent tutoring system, or even in a game—or it can simply be used for a 
set of student clicks, button selections, or keyboard hits expected to reach a correct answer. This 
solution provides flexibility for existing automatic scoring techniques and potentially could 
provide more if coupled with statistical data mining techniques. The limitations of such an 
approach are that it considers the alphabet to be the graphemes without taking deeper linguistic 
features into account. Hence, a phenomenon like negation could be missed or needs a specific 
treatment by considering, for example, order and n consecutive graphemes as part of the 
alphabet. 

The LCS approach was only one possible solution or similarity measure and within this 
approach; there were many choice points that could lead to various interesting research studies, 
such as the choice of the normalized dissimilarity, the approach of collapsing the row of the 
(dis)similarity matrix, the selection of a threshold, and even, probably, weighing different 
graphemes differently. Many choice points could be refined by additional data evaluation. 

Our next steps will include trying several evaluation techniques, such as treating each 
language separately using bigger datasets and defining, comparing to a baseline, and considering 
additional attributes like the position of the characters in the passages. Also, this approach is 
being implemented operationally and an evaluation with more items and a larger number of 
responses will be available soon that will, empirically, help us in our measure selection. 
Furthermore, we need to conduct a comparative evaluation with a different alignment technique 
with various algorithms to find optimal alignments. These algorithms probably will be borrowed 
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from the field of bioinformatics—in particular, how to select an optimal alignment. Finally, it is 
important to note that the size of datasets in our case, as compared to big data tasks that exist in 
bioinformatics or natural language processing, is very small. 
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Notes 


1 We will not make a distinction between character-based, alphabet-based, abj ad-based, 
syllable-based, abugida-based languages, and so forth. We will just use grapheme to 
denote the smallest unit. 

‘ Proofs showing the correctness of such algorithms are not included in this document. 

3 

A programming language in continuous development since 1987; its main author is Jan 
Wielmaker. 
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